--- title: Lesson 6 - Introduction to persistent homology keywords: fastai sidebar: home_sidebar summary: "A first look at persistent homology, the workhorse underlying topological machine learning." description: "A first look at persistent homology, the workhorse underlying topological machine learning." ---
In general terms, persistent homology focuses on answering the question
As one increases a threshold, at what scale do we observe changes in some geometric representation of the data?
The answer to this question involves deep mathematics and will not be covered here. Perhaps the simplest way to understand persistent homology is in terms of growing balls around each point. The basic idea is that by keeping track of when the balls intersect we can quantify when topological features like connected components and loops are "born" or "die".
For example, consider two noisy clusters and track their connectivity or "0-dimensional persistent homology" as we increase the radius of the balls around each point:

As the ball radius is grown from 0 to infinity, 0-dimensional persistent homology records when the ball in one connected component first intersects a ball of a different connected component (denoted by a different colour in the animation). At radius 0, a connected component for each point is born and once any two balls touch we have a death of a connected component with one color persisting and the other color vanishing. Each color vanishing corresponds to a death and therefore another point being added to the persistence diagram.
Referece
With 1-dimensional persistent homology, we still blow up balls around points but instead of tracking connected components, we track the birth and death of loops. Consider for example the visualisation below:

Note that in this example a persistent loop is born around a radius of 0.25 and dies arouns 0.6, giving rise to the outlier in the persistence diagram.
The above discussion generalises to higher dimensions, where instead we just grow hyperspheres instead of balls and look for the birth and death of multidimensional voids.
%load_ext autoreload
%autoreload 2
# data wrangling
import numpy as np
import pandas as pd
from pathlib import Path
# tda magic
from gtda.homology import VietorisRipsPersistence
from gtda.diagrams import PersistenceEntropy
from gtda.plotting import *
# ml tools
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, train_test_split
# dataviz
import matplotlib.pyplot as plt
def make_point_clouds(n_samples_per_shape, n_points, noise):
circle_point_clouds = [
np.asarray(
[
[np.sin(t) + noise * (np.random.rand(1)[0] - 0.5), np.cos(t) + noise * (np.random.rand(1)[0] - 0.5), 0]
for t in range((n_points ** 2))
]
)
for kk in range(n_samples_per_shape)
]
# label circles with 0
circle_labels = np.zeros(n_samples_per_shape)
sphere_point_clouds = [
np.asarray(
[
[
np.cos(s) * np.cos(t) + noise * (np.random.rand(1)[0] - 0.5),
np.cos(s) * np.sin(t) + noise * (np.random.rand(1)[0] - 0.5),
np.sin(s) + noise * (np.random.rand(1)[0] - 0.5),
]
for t in range(n_points)
for s in range(n_points)
]
)
for kk in range(n_samples_per_shape)
]
# label spheres with 1
sphere_labels = np.ones(n_samples_per_shape)
torus_point_clouds = [
np.asarray(
[
[
(2 + np.cos(s)) * np.cos(t) + noise * (np.random.rand(1)[0] - 0.5),
(2 + np.cos(s)) * np.sin(t) + noise * (np.random.rand(1)[0] - 0.5),
np.sin(s) + noise * (np.random.rand(1)[0] - 0.5),
]
for t in range(n_points)
for s in range(n_points)
]
)
for kk in range(n_samples_per_shape)
]
# label tori with 2
torus_labels = 2 * np.ones(n_samples_per_shape)
point_clouds = np.concatenate((circle_point_clouds, sphere_point_clouds, torus_point_clouds))
labels = np.concatenate((circle_labels, sphere_labels, torus_labels))
return point_clouds, labels
Figure reference: https://pythonawesome.com/semantic-and-instance-segmentation-of-lidar-point-clouds-for-autonomous-driving/
Techniques for analyzing 3D shapes are becoming increasingly important due to the vast number of sensors such as LiDAR that are capturing 3D data, as well as numerous computer graphics applications. These raw data are typically collected in the form of a point cloud, which corresponds to a set of 3D points $\{p_i | i = 1, \ldots, n\}$, where each point $p_i$ is a vector of its $(x, y, z)$ coordinates plus extra feature channels such as color, intensity etc. Typically Euclidean distance is used to calculate the distances between any two points.
By finding suitable representations of these point clouds, machine learning can be used to solve a variety of tasks such as those shown in the figure below.
Figure reference: adapted from https://arxiv.org/abs/1612.00593
point_clouds, labels = make_point_clouds(n_samples_per_shape=10, n_points=20, noise=0.5)
point_clouds.shape, labels.shape
Here the labels are chosen to that a circle is 0, a sphere is 1, and a torus is 2. Each point cloud corresponds to a sampling of the continuous surface - in this case 400 points are sampled per sphere or torus. As a sanity check, let's visualise these points clouds:
# expect circle
plot_point_cloud(point_clouds[0])
# expect sphere
plot_point_cloud(point_clouds[10])
# expect torus
plot_point_cloud(point_clouds[-1])
In raw form, point cloud data is not well suited for most machine learning algorithms because we ultimately need a feature matrix $X$ where each row corresponds to a single sample (i.e. point cloud) and each column corresponds to a particular feature. In our example, each point cloud corresponds to a collection of 3-dimensional vectors, so we need some way of representing this information in terms of a single vector $x^{(i)}$ of feature values.
In this lesson we will use persistent homology to generate a topological summary of a point cloud in the form of a so-called persistence diagram.
# track connected components, loops, and voids
homology_dimensions = [0, 1, 2]
# calculating persistence is memory intensive - adjust n_jobs according to how many cores you have!
persistence = VietorisRipsPersistence(metric="euclidean", homology_dimensions=homology_dimensions, n_jobs=6)
%time persistence_diagrams = persistence.fit_transform(point_clouds)
Once we have computed our persistence diagrams, we can compare how the circle, sphere and torus produce different patterns at each homology dimension:
# circle
plot_diagram(persistence_diagrams[0])
# sphere
plot_diagram(persistence_diagrams[10])
# torus
plot_diagram(persistence_diagrams[-1])
Although persistence diagrams are useful descriptors of the data, they cannot be used directly for machine learning applications. This is because different persistence diagrams may have different numbers of points, and basic operations like the addition and multiplication of diagrams are not well-defined.
To overcome these limitations, a variety of proposals have been made to “vectorise” persistence diagrams via embeddings or kernels which are well-suited for machine learning. In giotto-tda, we provide access to the most common vectorisations via the gtda.diagrams module.
For example, one such feature is known as persistent entropy and can be applied to our diagrams as follows:
persistent_entropy = PersistenceEntropy()
# calculate topological feature matrix
X = persistent_entropy.fit_transform(persistence_diagrams)
# expect shape - (n_point_clouds, n_dims)
X.shape
With topological features at hand, the next step is to train a classifier. Since our dataset is small, we will use a Random Forest and make use of the OOB score to simulate accuracy on a validation set:
labels.shape
data = np.concatenate((X,labels[:, None]), axis=1)
np.random.shuffle(data)
X, y = data[:,:3], data[:, 3]
rf = RandomForestClassifier(oob_score=True)
rf.fit(X, y)
# score
rf.oob_score_
from gtda.diagrams import *
dm = PairwiseDistance(n_jobs=6).fit_transform(dgms)
plot_heatmap(dm, colorscale="viridis")
from sklearn.manifold import MDS
mds = MDS(n_components=3, dissimilarity="precomputed")
mds.fit_transform(dm)
XWass = mds.embedding_
plot_point_cloud(XWass)
point_clouds.shape
X = point_clouds.reshape(60, 400 * 3)
X.shape
X = point_clouds[-1]
from sklearn.decomposition import PCA
pca = PCA(n_components=1)
red = pca.fit_transform(X)
red.shape
pca.components_
X_centered = X - np.mean(X, axis=0)
cov_matrix = np.dot(X_centered.T, X_centered) / len(X)
eigenvalues = pca.explained_variance_
for eigenvalue, eigenvector in zip(eigenvalues, pca.components_):
print(np.dot(eigenvector.T, np.dot(cov_matrix, eigenvector)))
print(eigenvalue)
print(eigenvector)
pca.components_
labels
X[0]
X_train, X_valid, y_train, y_valid = train_test_split(X, labels, test_size=0.1)
y_train.shape
rf = RandomForestClassifier().fit(X_train, y_train)
rf.score(X_valid, y_valid)
rf = RandomForestClassifier(oob_score=True)
rf.fit(X, labels)
rf.oob_score_
plot_point_cloud(point_clouds[2])
point_clouds.shape
Do the black hole example
circle = np.asarray([[np.sin(t), np.cos(t), 0] for t in range(400)])
plot_point_cloud(circle)
vr_persistence = VietorisRipsPersistence().fit_transform_plot([circle_point_cloud])
import glob
DATA = Path("../data/shapes")
DATA.name
df[:3]
fighters = np.asarray(
[
pd.read_csv(path, names=["x", "y", "z", "r", "g", "b"], usecols=["x", "y", "z"], sep=" ").sample(400).values
for path in DATA.rglob("*.pts")
if "fighter" in path.name
]
)
fighters.shape
fighters = np.asarray(
[
pd.read_csv(path, names=["x", "y", "z", "r", "g", "b"], usecols=["x", "y", "z"], sep=" ").sample(400).values
for path in DATA.rglob("*.pts")
if "fighter" in path.name
]
)
vases = np.asarray(
[
pd.read_csv(path, names=["x", "y", "z", "r", "g", "b"], usecols=["x", "y", "z"], sep=" ").sample(400).values
for path in DATA.rglob("*.pts")
if "vase" in path.name
]
)
chairs = np.asarray(
[
pd.read_csv(path, names=["x", "y", "z", "r", "g", "b"], usecols=["x", "y", "z"], sep=" ").sample(400).values
for path in DATA.rglob("*.pts")
if "desk_chair" in path.name
]
)
vases.shape
shapes = np.concatenate((fighters, vases, chairs))
shapes.shape
X = shapes.reshape(30, 400 * 3)
labels = np.zeros(30)
from sklearn.datasets import make_moons
X, y = make_moons()
X.shape
y
labels[10:20] = 1
labels[20:] = 2
rf = RandomForestClassifier(oob_score=True, random_state=42)
rf.fit(X, labels)
rf.oob_score_
labels[10:20] = 1
point_clouds = np.concatenate((chairs, vases))
point_clouds.shape
shapes = [
"biplane",
"desk_chair",
"dining_chair",
"fighter_jet",
"fish",
"flying_bird",
"guitar",
"handgun",
"head",
"helicopter",
"human",
"potted_plant",
"race_car",
"sedan",
"shelves",
"ship",
"sword",
"table",
"vase",
]
def plot_shape(shape):
print(shape)
filename = shape + "1.pts"
df = pd.read_csv(DATA / filename, names=["x", "y", "z", "r", "g", "b"], usecols=["x", "y", "z"], sep=" ")
plot_point_cloud(df.values)
index = 2
plot_shape(shapes[index])
df.shape
from gtda.plotting import *
df.iloc[:, :3]
plot_point_cloud(df.values)
Let's begin by generating 3D point clouds of spheres and tori, along with a label of 0 (1) for each sphere (torus). We also add noise to each point cloud, whose effect is to displace the points sampling the surfaces by a random amount in a random direction:
point_clouds, labels = generate_point_clouds(1, 20, 0)
# check shapes
point_clouds.shape, labels.shape
Each point cloud corresponds to a sampling of the continuous surface - in this case 225 points are sampled per sphere or torus. As a sanity check, let's visualise these points clouds:
point_clouds = chairs
# expect sphere
plot_point_cloud(point_clouds[0])
# expect torus
plot_point_cloud(point_clouds[-1])
The first step in our topological machine learning pipeline is to construct persistence diagrams from the training data. In giotto-tda, there are two main ways to achieve this:
For our application we will focus on point clouds, where for each point cloud our goal is to use persistent homology to generate a topological summary in the form of persistence diagrams. A fast and common way to calculate persistent homology is via the so-called Vietoris-Rips complex. The mathematical details are not important for this example, rather the main idea is that we instantiate the class VietorisRipsPersistence just like a scikit-learn estimator and then make use of the fit-transform paradigm to generate the diagrams:
persistence = VietorisRipsPersistence()
persistence_dgms = persistence.fit_transform(point_clouds)
point_clouds[-2:].shape
# track connected components, loops, and voids
homology_dimensions = [0, 1, 2]
persistence = VietorisRipsPersistence(metric="euclidean", homology_dimensions=homology_dimensions, n_jobs=6)
%time persistence_diagrams = persistence.fit_transform(shapes)
Once we have computed our persistence diagrams, we can compare how the sphere and torus produce different patterns at each homology dimension:
persistence_diagrams.shape
# sphere
plot_diagram(persistence_diagrams[0])
# sphere
plot_diagram(persistence_diagrams[11])
# torus
plot_diagram(persistence_diagrams[-1])
Notice the presence of a persistent void (H2) in the case of the sphere, and the presence of persistent loops (H1) in the case of the torus.
Although persistence diagrams are useful descriptors of the data, they cannot be used directly for machine learning applications. This is because different persistence diagrams may have different numbers of points, and basic operations like the addition and multiplication of diagrams are not well-defined.
To overcome these limitations, a variety of proposals have been made to “vectorise” persistence diagrams via embeddings or kernels which are well-suited for machine learning. In giotto-tda, we provide access to the most common vectorisations via the gtda.diagrams module.
For example, one such feature is known as persistent entropy and can be applied to our diagrams as follows:
persistent_entropy = PersistenceEntropy()
features = persistent_entropy.fit_transform(persistence_dgms)
persistent_entropy = PersistenceEntropy()
# calculate topological feature matrix
X = persistent_entropy.fit_transform(persistence_diagrams)
# expect shape - (n_point_clouds, n_dims)
X.shape
X[0]
With topological features at hand, the next step is to train a classifier. Since X is just like any feature matrix, we can create a train and validation set a usual:
labels = np.zeros(20)
labels[10:] = 1
labels
X_train, X_valid, y_train, y_valid = train_test_split(X, labels)
The final step is to train and score a model - let's use a Random Forest and evaluate the accuracy of our predictions on the validation set:
rf = RandomForestClassifier(oob_score=True)
rf.fit(X, labels)
# score
rf.oob_score_
It works!
rf = RandomForestClassifier()
cross_val_score(rf, X_train, y_train)
ts = np.load("data1samp-1.npy")
ts["data"].shape
ts.shape
ts[0][5].shape
plt.plot(ts[9][5])
def padrand(V, n, kr):
cut = np.random.randint(n)
rand1 = np.random.randn(cut)
rand2 = np.random.randn(n - cut)
out = np.concatenate((rand1 * kr, V, rand2 * kr))
return out
RN = 20 # this is the number of SNR values generated
Rcoef = np.linspace(0.075, 0.65, RN)
Npad = 2000 # number of padding points on either side of the vector
gw = np.load("data1samp-1.npy")
Norig = len(gw["data"][0])
Ndat = len(gw["signal_present"])
ri = []
k = 0
nsamp = 2 # downsampling value, 2 means divide the sample rate by 2
N = int(Norig / nsamp)
Ndattotal = 100 # number of data elements to generate
ncoeff = []
Rcoeflist = []
for j in range(Ndattotal):
ncoeff.append(10 ** (-19) * (1 / Rcoef[j % RN]))
Rcoeflist.append(Rcoef[j % RN])
x = []
xsig = []
y = []
pcloud = []
maxs = 0.0
k = 0
y = np.zeros((Ndattotal, 2))
for j in range(Ndattotal):
signal = gw["data"][j % Ndat][range(0, Norig, nsamp)]
sigp = int((np.random.randn() < 0))
noise = ncoeff[j] * np.random.randn(N)
y[j, 0] = sigp
y[j, 1] = 1 - sigp
if sigp == 1:
rawsig = padrand(signal + noise, Npad, ncoeff[j])
# generate the actual synthetic raw
# by adding noise and padding with noise
if k == 0:
k = 1
else:
rawsig = padrand(noise, Npad, ncoeff[j])
rawsig.shape
plt.plot(rawsig)
from PIL import Image
from gtda.images import *
cells_original = Image.open("../data/Cells.jpg").convert("L")
cells_original
from scipy import sparse
def lower_star_img(img):
"""
Construct a lower star filtration on an image
Parameters
----------
img: ndarray (M, N)
An array of single channel image data
Returns
-------
I: ndarray (K, 2)
A 0-dimensional persistence diagram corresponding to the sublevelset filtration
"""
m, n = img.shape
idxs = np.arange(m * n).reshape((m, n))
I = idxs.flatten()
J = idxs.flatten()
V = img.flatten()
# Connect 8 spatial neighbors
tidxs = np.ones((m + 2, n + 2), dtype=np.int64) * np.nan
tidxs[1:-1, 1:-1] = idxs
tD = np.ones_like(tidxs) * np.nan
tD[1:-1, 1:-1] = img
for di in [-1, 0, 1]:
for dj in [-1, 0, 1]:
if di == 0 and dj == 0:
continue
thisJ = np.roll(np.roll(tidxs, di, axis=0), dj, axis=1)
thisD = np.roll(np.roll(tD, di, axis=0), dj, axis=1)
thisD = np.maximum(thisD, tD)
# Deal with boundaries
boundary = ~np.isnan(thisD)
thisI = tidxs[boundary]
thisJ = thisJ[boundary]
thisD = thisD[boundary]
I = np.concatenate((I, thisI.flatten()))
J = np.concatenate((J, thisJ.flatten()))
V = np.concatenate((V, thisD.flatten()))
sparseDM = sparse.coo_matrix((V, (I, J)), shape=(idxs.size, idxs.size))
return sparseDM
ts = np.linspace(-1, 1, 100)
x1 = np.exp(-(ts ** 2) / (0.1 ** 2))
ts -= 0.4
x2 = np.exp(-(ts ** 2) / (0.1 ** 2))
img = -x1[None, :] * x1[:, None] - 2 * x1[None, :] * x2[:, None] - 3 * x2[None, :] * x2[:, None]
plt.imshow(img)
plt.show()
dm = lower_star_img(img)
vr_persistence = VietorisRipsPersistence(metric="precomputed", homology_dimensions=[0], n_jobs=6, infinity_values=0.5)
dgms = vr_persistence.fit_transform_plot(dm.toarray()[None, :, :])
import PIL
cells_original = plt.imread("../data/Cells.jpg")
cells_grey = np.asarray(PIL.Image.fromarray(cells_original).convert("L"))
plt.subplot(121)
plt.title(cells_original.shape)
plt.imshow(cells_original)
plt.axis("off")
plt.subplot(122)
plt.title(cells_grey.shape)
plt.imshow(cells_grey, cmap="gray")
plt.axis("off")
plt.show()
dm = lower_star_img(-cells_grey)
dm.toarray().shape
vr_persistence = VietorisRipsPersistence(metric="precomputed", homology_dimensions=[0], n_jobs=3)
dgms = vr_persistence.fit_transform_plot(dm.toarray()[None, :, :])
plt.scatter(dgms[:, 0], dgms[:, 1])
dgms
cb = CubicalPersistence(homology_dimensions=[0, 1]).fit_transform_plot(np.array(cells_grey))
cb.shape
from scipy import ndimage
# Normalize to the range [0, 1]
cells_grey = -ndimage.uniform_filter(cells_original, size=10)
cells_grey = cells_grey - np.min(cells_grey)
cells_grey = cells_grey / np.max(cells_grey)
plt.imshow(cells_grey, cmap="afmhot")
plt.colorbar()
binary = Binarizer(threshold=0.5)
cells_binary = binary.fit_transform_plot(cells_grey[None, :, :])
cells_binary
plot_heatmap(cells_grey)
cells_grey
-np.ones(2)
hf = HeightFiltration(np.array([-1, -1]))
cells_grey[None, :, :].shape
hf.fit_transform(cells_binary)
;
filtration = hf.fit_transform_plot(cells_binary)
;
filtration.shape
filtration[0].shape
plt.imshow(filtration[0][:100])
plt.imshow(filtration[0], cmap="viridis")
plt.colorbar()
;
plt.scatter(persistence_diagrams[0][:, 0], persistence_diagrams[0][:, 1])
filtration.shape
I.shape
plt.scatter(np.mean(filtration[0][0]), np.mean(filtration[0][1]))
filtration[0].shape
# track connected components, loops, and voids
homology_dimensions = [0]
persistence = VietorisRipsPersistence(metric="euclidean", homology_dimensions=homology_dimensions, n_jobs=6)
%time persistence_diagrams = persistence.fit_transform(filtration)
persistence_diagrams[0]
I = persistence_diagrams[0]
# I = I[I[:, 1]-I[:, 0] > 0.001, :]
Inorm = (I - np.min(I)) / np.max(I)
Inorm
I.shape
plot_diagram(I)
cells_binary.shape
F = cells_grey + 0.001 * np.random.randn(cells_grey.shape[0], cells_grey.shape[1])
# I = HeightFiltration(direction=-np.ones(2)).fit_transform(F[None, :, :])
# I = I[I[:, :, 1]-I[:, :, 0] > 0.001, :] # Filter out low persistence values
X, Y = np.meshgrid(np.arange(F.shape[1]), np.arange(F.shape[0]))
X = X.flatten()
Y = Y.flatten()
Inorm.shape
fig, ax = plt.subplots()
cutoff = 0.9
idxs = np.arange(Inorm.shape[0])
idxs = idxs[Inorm[:, 1] > cutoff]
ax.imshow(cells_original)
for idx in idxs:
bidx = np.argmin(np.abs(F - Inorm[idx, 1]))
ax.scatter(X[bidx], Y[bidx], 20, "k")
I[0]
I.shape
idxs.shape
F.shape
I[0]
I[1]
np.max(F)
plt.figure()
ts = np.linspace(-1, 1, 100)
x1 = np.exp(-(ts ** 2) / (0.1 ** 2))
ts -= 0.4
x2 = np.exp(-(ts ** 2) / (0.1 ** 2))
# Define depths of Gaussian blobs
h1 = -1
h2 = -2
h3 = -3
img = h1 * x1[None, :] * x1[:, None] + h2 * x1[None, :] * x2[:, None] + h3 * x2[None, :] * x2[:, None]
plt.imshow(img, cmap="afmhot")
plt.colorbar()
plt.show()
img_scaled = (img - np.min(img)) / (np.max(img) - np.min(img))
img_bin = Binarizer().fit_transform_plot(img_scaled[None, :, :])
filtration = HeightFiltration(np.array([-1, -1])).fit_transform_plot(img_bin)
filtration.shape
dgms = VietorisRipsPersistence(homology_dimensions=[0, 1]).fit_transform_plot(filtration)
from gtda.diagrams import *
Scaler().fit_transform_plot(dgms)
;